Bike sharing system is a service in which member or non-member individuals can rent bicycle on a short term duration. This scheme is going popular around the globe. We aim to find out: “ How we can fit a linear model to predict the number of bike rental demands for any kind of potential customer?”, as a part of a Kaggle competition (https://www.kaggle.com/c/bike-sharing-demand). The bike sharing data (including training and test files) is available at the Kaggle competition website and you can download it directly from the following url: https://www.kaggle.com/c/bike-sharing-demand/data
At the first sight the problem looks simple. The main challenge is to structure the data and extract the relevant information.
The goal of this exrcise is to predict the exact number of bicycle demands for customers so that the bike sharing system providers are be able to setup the system and also the users plan their travels more easily. The data set provided by Kaggle demonstrates a snapshot of hourly rental data over two years (from January 1, 2011 to December 31,2012).
The training data set is for the first 19 days of each month. The remaining days are considered as test data. We aim to predict the total rental bikes counts hourly in test data. The date and time are provided as a string feature in the data set including several important information about the year, month, day of the week and hour.
library(lubridate)
library(car) # for scatterplot
library(chron)
# read data:
setwd("~/Dropbox (MAGI)/farnoush-vahid/r-workshop/data/")
train=read.csv("train.csv", stringsAsFactors = FALSE)
test=read.csv("test.csv", stringsAsFactors = FALSE)
The raw data set consists of 10886 records in which each observation consists of 12 attributes with no missing value as follows:
#dataset dimension:
cat("Number of training rows: ", nrow(train), "\n")
cat("Number of training cols: ", ncol(train), "\n")
# columns name"
names(train)
# review data:
head(train)
str(train)
# check for missing values in data:
table(is.na(train))
Raw data features and description are provided by information in Table I as follows:
| Feature | Description |
|---|---|
| 1.Date and hour | provided as a single string. |
| 2. Season | represented by 1 to 4 (1: spring, 2: summer, 3: fall, 4: winter) |
| 3. Holiday | a binary variable representing whether the particular day is a holiday |
| 4. Working day | a binary variable representing whether the day is neither a weekend nor holiday |
| 5. Weather | a categorical variable with 4 levels representing different weather conditions: 1: Clear, Few clouds, Partly cloudy; 2: Mist ,Cloudy, Broken clouds; 3: Light Rain, Light Snow, Scattered clouds; 4: Heavy Rain , Heavy snow ,Thunderstorm , Fog |
| 6. Temp | Hourly temperature in celsius which is continuous data in range of [0.8,41] |
| 7. aTemp | the weather feel like in celsius which is in range of [0.7,46] |
| 8. Humidity | percentage of weather humidity which is in range of [0,100] |
| 9. Windspeed | continues data in range of [0,57] |
| 10. Casual | Represents the number of bike rentals made by non-registered users (Y) |
| 11. Registered | represents the number of bike rentals made by registered users (Y) |
| 12. Count | summation of casual and registered values (the value that we intend to predict (Y) ) |
To understand the distribution of variables, we plot a frequency histogram for each numeric variable and analyze the distribution of and the total demand counts (as target feature) against each variable.
# plot frequency diagrams:
par(mfrow=c(4,3))
par(mar = rep(3, 4))
# Filled Density Plot
d <- density(train$season)
plot(d, main="Kernel Density of season")
polygon(d, col="green", border="green")
hist(train$weather, breaks=12, col="red")
d <- density(train$weather)
plot(d, main="Kernel Density of weather")
polygon(d, col="blue", border="blue")
# Kernel Density Plot
d <- density(train$humidity) # returns the density data
plot(d, col = "red") # plots the results
hist(train$holiday, breaks=12, col="green")
hist(train$workingday, breaks=12, col="blue")
d <- density(train$temp) # returns the density data
plot(d, col = "blue3") # plots the results
hist(train$atemp, breaks=12, col="blue1")
hist(train$windspeed, breaks=12, col="red")
After deeping dive in the data set, we have the hypothesis that the date and time feature makes the analysis of data difficult. Therefore we convert it into four separate variables: month, day of week, hour of day and weekend/nonweekend features to have more information and improve the accuracy of our prediction. We plot the trend of count over these new features and check whether the hypothesis is correct or not.
########## feature enginerring#########
# for tranforming datetime and categorical variables we do as follows:
datetime <- ymd_hms(train$datetime)
train$datetime <- datetime
train$hour <- hour(datetime)
train$day <- as.integer(factor(wday(ymd_hms(train$datetime), label=TRUE), levels = c("Mon", "Tues", "Wed",
"Thurs", "Fri", "Sat", "Sun"),ordered = TRUE))
train$month <- as.integer(factor(months(datetime) ,levels = c("January"
,"February"
,"March"
,"April"
,"May"
,"June"
,"July"
,"August"
,"September"
,"October"
,"November"
,"December") ,ordered = TRUE))
train$weekend <- as.integer(factor(is.weekend(train$datetime)), levels = c("TRUE", "FALSE"), ordered = FALSE)
# plot frequency diagrams for new features:
par(mfrow=c(4,2))
par(mar = rep(2, 4))
hist(train$hour, breaks=12, col="red")
hist(train$day, breaks=12, col="blue")
hist(train$month, breaks=12, col="green")
hist(train$weekend, breaks=12, col="blue")
The result obtained by following graphs clearly satisfied our hypothesis. As understood from following histograms, the season feature has four categories of almost same distribution, and weather attribute with partly cloudy or clear state is more frequent than other weather conditions. Furthermore, a considerable number of demands are received on working days compared to the days which are neither a weekend nor holiday.
Next, a Linear Model Regression method is used to predict the exact number of rental bike demands. The algorithm is performed over two different feature combinations, raw data and data with new features, to find the best fitted model.
Performing linear model on raw data gives the following result:
# compare fitted model with linear regression for raw data and data with new features
lm1 <- lm(count ~ .,data = train[c("season"
,"holiday"
,"workingday"
,"weather"
,"temp"
,"atemp"
,"humidity"
,"windspeed"
,"count")])
summary(lm1)
plot(lm1, col = "green")
and running over data with new features yiels to the charts below:
lm2 <- lm(count ~ .,data = train[c("season"
,"holiday"
,"workingday"
,"weather"
,"temp"
,"atemp"
,"humidity"
,"windspeed"
,"hour"
,"day"
,"month"
,"count")])
summary(lm2)
plot(lm2, col = "blue")